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CONSISTENCY OF SUPPORT VECTOR MACHINES FOR 
FORECASTING THE EVOLUTION OF AN UNKNOWN ERGODIC 
DYNAMICAL SYSTEM FROM OBSERVATIONS WITH UNKNOWN 

NOISE 



By Ingo Steinwart and Marian Anghel 

Los Alamos National Laboratory 

We consider the problem of forecasting the next (observable) 
state of an unknown ergodic dynamical system from a noisy obser- 
vation of the present state. Our main result shows, for example, that 
support vector machines (SVMs) using Gaussian RBF kernels can 
learn the best forecaster from a sequence of noisy observations if 
(a) the unknown observational noise process is bounded and has a 
summable a-mixing rate and (b) the unknown ergodic dynamical sys- 
tem is defined by a Lipschitz continuous function on some compact 
subset of R'' and has a summable decay of correlations for Lipschitz 
continuous functions. In order to prove this result we first establish a 
general consistency result for SVMs and all stochastic processes that 
satisfy a mixing notion that is substantially weaker than a-mixing. 



Let US assume that we have an ergodic dynamical system described by the 
sequence (-F"')n>o of iterates of an (essentially) unknown map F : M ^ M, 
where M C M"^ is compact and the corresponding ergodic measure fi is as- 
sumed to be unique. Furthermore, assume that all observations x of this 
dynamical system are corrupted by some stationary, R^'-valued, additive 
noise process £ = (en)n>o whose distribution v we assume to be indepen- 
dent of the state, but otherwise unknown, too. In other words all possible 
observations of the system at time n > are of the form 



where xq is a true but unknown state at time 0. Now, given an observation of 
the system at some arbitrary time, our goal is to forecast the next observable 
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state (we will see later that under some circumstances this is equivalent to 
forecasting the next true state), that is, given x + e we want to forecast 
F[x) + e', where e and e' are the observational errors for x and its successor 
F[x). Of course, if we know neither F nor then this task is impossible, 
and hence we assume that we have a finite sequence T = {xq, . . . of 
noisy observations from a trajectory of the dynamical system, that is, all 
5j, i = 0, ...,n — 1, are given by (1) for a conjoint initial state xq. Now, 
informally speaking, our goal is to use T to build a forecaster / : — > 
whose average forecasting performance on future noisy observations is as 
small as possible. In order to render this goal more precisely we need a loss 
function L : M*^ — > [0, oo) such that 

L{F[x) + e' - f{x + e)) 

gives a value for the discrepancy between the forecast /(x + e) and the 
observed next state F{x) +e'. In the following, we always assume implicitly 
that small values of L[F{x) + e' — f{x + e)) correspond to small values of 
||-F(a;) + e' — f{x + e)||2, where || • ||2 denotes the Euclidean distance in W^. 
Now, by the stationarity of £, the average forecasting performance is given 
by the L-risk 

(2) niAf) -=11 L{F{x) + ei - f{x + eo)) y{de)ii{dx), 

where e = (ei)i>o and P := v ® ^. Obviously, the smaller the risk the better 
the forecaster is, and hence we ideally would like to have a forecaster /£ : 
— > that attains the minimal L-risk 

(3) nl^p :=inf{7^L,p(/)|/:M'^^M'^ measurable}. 

Now assume that we have a method C that assigns to every training set T 
a forecaster fx- Then the method C achieves our goal asymptotically, if it 
is consistent in the sense of 

(4) hm 7^i,p(/T)=7^2p, 

n — >oo 

where the limit is in probability P. 

To the best of our knowledge, the forecasting problem described by (l)-(4) 
has not been considered in the literature, and even the observational noise 
model itself has only been considered sporadically, though it clearly "cap- 
tures important features of many experimental situations" [27]. Moreover, 
most of the existing work on the observational noise model deals with the 
question of denoising [17, 23, 24, 25, 26, 27, 35]. In particular, [25, 26, 27] 
provide both positive and negative results on the existence of consistent 
denoising procedures. 

In [32] a related forecasting goal is considered for the least squares loss 
and stochastic processes of the form := F[Zi) -|-ej_|_i, i > 0, where (-F*) 
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is a dynamical system and (ej) is some additive and centered i.i.d. dynam- 
ical noise. In particular, consistency of two histogram-based methods is es- 
tablished if (a) F: Af — > M is continuous and (ej) is bounded, or (b) F is 
bounded and Si is absolutely continuous. Note that the first case shows that 
in the absence of dynamical and observational noise there is a method which 
can learn to identify F whenever it is continuous but otherwise unknown. 
However, it is unclear how to extend the methods of [32] to deal with obser- 
vational noise. 

Variants of the forecasting problem for general stationary ergodic pro- 
cesses (Zj) have been extensively studied in the literature. One often con- 
sidered variant is static autoregression (see [22], page 569, and the references 
therein) where the goal is to find sequences /m(Z„i, . . . , Z^m) of estimators 
that converge almost surely to E(Zo|Z_i, . . . , Z_oo), which is known to be 
the least squares optimal one-step-ahead forecaster using an infinite past of 
observations. However, even if forecasters using a longer history of observa- 
tions are considered in (2)-(4), the goal of static autoregression cannot be 
compared to our concept of consistency. Indeed, in static autoregression the 
goal is to find a near-optimal prediction for xq using the previously observed 
x^i, . . . , x^m of the same trajectory, whereas our goal is to use the observa- 
tions to build a predictor which predicts near optimal for arbitrary future 
observations. In machine learning terminology, static autoregression is thus 
an "on-line" learning problem whereas our notion of consistency defines a 
"batch" learning problem. 

Learning methods for estimating E(Zo|Z_i, . . . , Z_oo) in a sense similar 
to (4) are considered by, for example, [29, 30]; unfortunately these methods 
require a- or /9-mixing conditions for (Zj) that cannot be satisfied by non- 
trivial dynamical systems. Finally, a result by Nobel [31] shows that there 
is no method that is universally consistent for classification and regression 
problems where the data is generated by an arbitrary stationary ergodic 
process (Zi). In particular this result shows that our general consistency 
Theorem 2.4 cannot be extended to such (Zi). 

If the observational noise process £ is mixing in the ergodic sense, then it 
is not hard to check that the process described by (1) is ergodic and hence 
it satisfies a strong law of large numbers by Birkhoff 's theorem. Using the 
recent results in [39], we then see that there exists a support vector ma- 
chine (see the next section for a description) depending on F and £ which 
is consistent in the sense of (4). However, [39] does not provide an explicit 
method for finding a consistent SVM even if both F and £ are known. 
Consequently, it is fair to say that though SVMs do not have principal lim- 
itations for the forecasting problem described by (l)-(4), there is currently 
no theoretically sound way to use them. The goal of this work is to address 
this issue by showing that certain SVMs are consistent for all pairs {F, £) of 
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Lipschitz continuous F and bounded 8 that have a sufficiently fast decay of 
correlations for Lipschitz continuous functions. In particular, we show that 
these SVMs are consistent for all uniformly smooth expanding or hyperbolic 
dynamics F and all bounded i.i.d. noise processes £. 

The rest of this work is organized as follows: In Section 1 we recall the 
definition of support vector machines (SVMs). Then, in Section 2, we present 
a consistency result for SVMs and general stochastic processes that have 
a sufficiently fast decay of correlations. This result is then applied to the 
above forecasting problem in Section 3, where we also briefly review some 
dynamical systems with a sufficiently fast decay of correlations. Possible 
future extensions of this work are discussed in Section 4. Finally, the proofs 
of the two main results can be found in Sections 5 and 6, respectively. 

1. Support vector machines. The goal of this section is to briefly describe 
support vector machines, which were first introduced by [7, 15] as a method 
for learning binary classification tasks. Since then, they have been general- 
ized to other problem domains such as regression and anomaly detection, 
and nowadays they are considered to be one of the state-of-the-art machine 
learning methods for these problem domains. For a thorough introduction 
to SVMs, we refer the reader to the books [16, 36, 42]. 

Let us begin by introducing some notation related to SVMs. To this end, 
let us fix two nonempty closed sets X dW^ and y C M, and a measurable 
function L : X x y x M ^ [0, oo), which in the following is called loss function 
(note that this is a more general concept of a loss function than the informal 
notion of a loss function used in the introduction). For a finite sequence 
T = ((xi, yi), . . . , {xn,yn)) & {X X and a function / : X — > M, we define 
the empirical L-risk by 



and 7^2 p '■— '^^^{T^L.pif)]/ '■^'^ — > measurable} for the L-risk and mini- 
mal L-risk associated to P. Now, let H be the reproducing kernel Hilbert 
space (RKHS) of a measurable kernel k:X x X ^TZ (see [1] for a general 
theory of such spaces). Given a finite sequence T G {X x y)" and a regu- 
larization parameter A > 0, support vector machines construct a function 
fT,\,H -.X satisfying 



■^l,t(/) ■=-Y1 H^i^Vi' fi^i))- 



1=0 

Moreover, for a distribution P on X x y, we write 




(5) 



M\fTXH\\l +nLHfTXH) = }riUX\\f\\ji + nL,T{f))- 
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In the following we are mainly interested in the commonly used Gaussian 
RBF kernels k^-.X x X defined by 

k(^{x, x') := exp(— (j^llx — x'||2), x, x' S X, 

where X C M'^ is a nonempty subset and o" > is a free parameter called the 
width. We write Ha{X) for the corresponding RKHSs, which are described 
in some detail in [40] . Finally, for S VMs using a Gaussian kernel , we write 
/r,A,a := fT,\,Ha(x) ill order to simphfy notation. 

It is well known that if L is a convex loss function in the sense that 
L{x,y, •) :M — > [0,co) is convex for all {x,y) € X xY, then there exists a 
unique fT,x,H- Moreover, in this case (5) becomes a strictly convex opti- 
mization problem which can be solved by, for example, simple gradient de- 
scent algorithms. However, for specific losses, including the least squares 
loss, other more efficient algorithmic approaches are used in practice; see 
[36, 41, 42, 43]. Let us now introduce some additional properties of loss 
functions: 

Definition 1.1. A loss function L : X x Y x R ^ [0, oo) is called: 

(i) Differentiable if L{x, y, • ) : M — > [0, oo) is differentiable for all (x, y) G 
X X Y. In this case the derivative is denoted by L'{x,y, ■ ). 

(ii) Locally Lipschitz continuous if for all a > there exists a constant 
Ca > such that for all x G X, y GY and all t, t' G [—a, a] we have 

\L{x,y,t) - L{x,y,t')\ < Ca\t - t'\. 

In this case the smallest possible constant Ca is denoted by |i|a,i- 

(iii) Lipschitz continuous if \L\i := sup^>o \L\a,i < oo. 

With the help of these definitions we can now summarize assumptions on 
the loss function L that we will use frequently. 

Assumption L. The loss L : X x y x M ^ [0, oo) is convex, differentiable 
and locally Lipschitz continuous in the above sense, and it also satisfies 
L{x,y,0) < 1 for all {x,y) G X x Y. Moreover, for the derivative L' there 
exists a constant c G [0, oo) such that for all {x,y,t), (x' ,y' ,t') G X x Y x M 
we have \L'{x,y,0)\ < c and 

(6) \L'{x, y, t) - L'{x', y', t')\ < c\\{x, y, t) - {x' , y', t') lb- 

Note that combining the two assumptions on L' yields \L'{x,y,t)\ < c(l + 
|t|) for all {x,y,t) G X xY X M, and from this it is not hard to conclude that 
\L\a,i < c(l + a) for all a > 0. 

Since the Assumption L is rather complex let us now illustrate it for two 
particular classes of loss functions used in many SVM variants. 
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Example 1.2. A loss L: X x Y x R ^ [0,oo) of the form L{x,y,t) = 
f{yt) for a suitable function : M — > M and all x £ X, y £Y := {—1, 1} and 
t G M, is called margin-based. Obviously, L is convex, continuous, (locally) 
Lipscliitz continuous or differentiable if and only if ip is. In addition, con- 
vexity of L implies local Lipschitz continuity of L. Furthermore, recall that 
[6] showed that L is suitable for binary classification tasks if and only if ip 
is differentiable at with 93' (0) < 0. 

Let us now consider Assumption L. Obviously, the first part is satisfied 
if and only if if is convex and differentiable, and also satisfies (p{0) < 1. 
Note that the latter can always be ensured by rescaling ip. Furthermore, 
we have L'{x,y,t) = yip'{yt) and by considering the cases y = y' and y y' 
separately we see that (6) is satisfied if and only if (p' is Lipschitz continuous 
and satisfies 

\ip'{t) + ip'{t')\<c{l + \t + t'\), t,t'eR, 

for a constant c > 0. Finally, the condition |L'(x,y,0)| = |'/j'(0)| < c is al- 
ways satisfied for sufficiently large c. From these considerations we conclude 
that the classical SVM losses ip{t) = (1 — t)+ and ip{t) = (1 — t)^, where 
:=max{0,x}, do not satisfy Assumption L, whereas the least square 
loss and the logistic loss defined by ip{t) = (1 — t)^ and p{t) = ln(l-|-exp(— f)), 
respectively, fulfill Assumption L. 

Example 1.3. A loss L:X xY xR^ [0,oo) of the form L{y,t) = ilj{y- 
t) for a suitable function : M ^ M and all x G X, y € y C M and t G M, 
is called distance-based. Recall that distance-based losses such as the least 
squares loss -(/'(r) =r^, Huber's insensitive loss il>{r) = min{r^, max{l, 2|r| — 
1}}, the logistic loss ip{r) = ln((l -|- e^)^e~^) — ln4 or the e-insensitive loss 
tjj{r) = (|r| — are usually used for regression. 

In order to consider Assumption L we assume that y is a compact subset 
of M. Then it is easy to see that the first part of Assumption L is satisfied if 
and only if if} is convex and differentiable, and also satisfies sup^^gy ^/'(y) < 1. 
Note that the latter can always be ensured by rescaling ^ since the convexity 
of implies its continuity. Furthermore, we have L'{x,y,t) = —tp'^y — t), and 
hence we see that (6) is satisfied if and only if ip' is Lipschitz continuous. Fi- 
nally, every convex and differentiable function is continuously differentiable 
and hence we can always ensure \L'(x,y,0)\ = \^l''{y)\ < c. From these con- 
siderations we immediately see that all of the above distance-based losses 
besides the e-insensitive loss satisfy Assumption L. 

2. Consistency of SVMs for a class of stochastic processes. The goal 
of this section is to establish consistency of SVMs for a class of stochastic 
processes having a uniform decay of correlations for Lipschitz continuous 
functions. This result will then be used to establish consistency of SVMs for 
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the forecasting problem and suitable combinations of dynamical systems F 
and noise processes £. 

Let us begin with some notation. To this end, let us assume that we have 
a probability space {Q,A,ii), a measurable space {Z,B) and a measurable 
map T -.n ^ Z. Then cr(T) denotes the smallest cr-algebra on for which 
T is measurable. Moreover, ^it denotes the T-image measure of /i, which is 
defined by fJ-xiB) := fi{T~^{B)), B C Z measurable. Recall that a stochastic 
process Z := (Z„)„>o, that is, a sequence of measurable maps Z^ '. ^7 — ^ Z J 
n > 0, is called identically distributed if fiz„ = l^Zm fo'^ ^ n,m>0. In this 
case we write P := fizo i^i t^is following. Moreover, Z is called second-order 
stationary if tJ-(z,^+„z,^+,) = l^{z,^,Zi^) for all ii,i2,i> 1, and it is said to be 
stationary H f^(z,^^„...,z,„+,) = l^{z,^,...,z,j for all n,i,ii, . . . ,in> 1. 

The following definition introduces the correlation sequence for stochastic 
processes that will be used throughout this work. 

Definition 2.1. Let be a probability space, {Z,B) be a mea- 

surable space, ^ be a Z-valued, identically distributed process on and 
P := fizo- Then for ip,(p £ L2{P) the nth correlation, n > 0, is defined by 

cox z,n{i' ^'f) '■= / i^iZo) ■ ip{Zn)dn- / dP ■ / LpdP. 

Jn Jz Jz 

Obviously, if Z is an i.i.d. process, we have corz,n{ip,^) = for all t/?,^ G 
L2{P) and n > 0, and this remains true ii tpo Zq and ipo are only uncor- 
related. Consequently, if lim^^oo cor2,n(V'5 V') = the corresponding speed 
of convergence provides information about how fast i{j o Zq becomes uncor- 
related from if o Zn- This idea has been extensively used in the statistical 
literature in terms of, for example, the a-mixing coefficients 

a{Z,n):= sup \fi{An B) - fi{A)fi{B)\, 

— oo 

where J-^ is the initial cr-algebra of Zi, . . . ,Zj. These and related (stronger) 
coefficients together with examples including, for example, certain Markov 
chains, ARMA processes, and GARCH processes are discussed in detail in 
the survey article [10] and the books [8, 11, 21]. Moreover, for processes 
Z satisfying a{Z,n) < cn~°' for some constant c > and all n > 1 it was 
recently described in [39] how to find a regularization sequence (A„) for 
which the corresponding SVM is consistent. Unfortunately, however, it is 
well known that every nontrivial ergodic dynamical system is not a-mixing, 
that is, it does not satisfy liuin^oo 0(2^ , n) = 0, and therefore the result of 
[39] cannot be used to investigate consistency for the forecasting problem. 
On the other hand, various dynamical systems enjoy a uniform decay rate 
over smaller sets of functions such as Lipschitz continuous functions (see 
Section 3 for some examples). This leads to the following definition: 
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Definition 2.2. Let {Q.,A,fi) be a probability space, Z c M be a com- 
pact set, Z hea Z-valued, identically distributed process on O and P := jizo- 
Moreover, let (7i)i>o be a strictly positive sequence converging to 0. Then Z 
is said to have a decay of correlations of the order (ji) if for all ip^^ ^ Lip(Z) 
there exists a constant k^^^ £ [0, oo) such that 

(7) \covz,i{tp,(p)\ < i>0, 

where Lip(Z) denotes the set of all Lipschitz continuous f : Z ^M. 

Recall (see, e.g.. Theorem 4.13 in Vol. 3 of [11]) that for every Z-valued, 
identically distributed process Z and all bounded functions ijj,ip: Z we 
have 

I cor z,i{'ip,^)\ < 27r||V'||oo||¥'llooa(^,«), i > 1- 

Since Lipschitz continuous functions on compacta are bounded, we hence see 
that a-mixing processes have a decay of correlations of the order {a{Z,i)). 
In Section 3 we will present some examples of dynamical systems that are 
not a-mixing but have a nontrivial decay of correlations. 

Let us now summarize our assumptions on the process Z which we will 
make in the rest of this section. 



Assumption Z. The process Z = {Xi,Yi)i>o is defined on the proba- 
bility space {Q,A,^) and is X x y- valued, where X C M*^ and y C M are 
compact subsets. Moreover Z is second-order stationary. 

Finally, we will need the following mutually exclusive assumptions on the 
regularization sequence and the kernel width of SVMs: 

Assumption SI. For a fixed strictly positive sequence (7j)i>o converg- 
ing to and a locally Lipschitz continuous loss L the monotone sequences 
(A„) C (0, 1] and (cr„) C [1, oo) satisfy lim„_^oo An = 0, sup„>i e'~'^"\L\ 1/2 < 

00, 

sup TTi < 00 and hm ^^^—^ > 7^ = 0. 

n>i|L|,-i/2 ™ nXt ^ 



Assumption S2. For a fixed strictly positive sequence (7j).t>o converg- 
ing to and a locally Lipschitz continuous loss L the sequences (A„) C (0, 1] 
and (an) C [1, 00) satisfy lim^^oo A„cr^ = 0, 

lim , " = 00 and lim V 7i = 0. 
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Assumption S3. For a fixed strictly positive sequence (7i)i>o converg- 
ing to and a locally Lipschitz continuous loss L the monotone sequences 
(An) C (0, 1] and (cr^) C [1, oo) satisfy lim„_^oo A^ = 0, linifj^oo e -'^"|L| 1/2 , = 

OO, 

X^a^d \L\\-i/2 n-i 

sup I " — < oo and lim — ^. 7^ = 0. 

n>i |L|,--i/2^^ — nXi ^ 

Remark 2.3. In order to illustrate the Assumptions SI, S2 and S3, let 
us assume for simplicity that L is Lipschitz continuous; the case that L is 
the least squares loss will be considered in Remark 3.4. Now note that for 
Lipschitz continuous losses Assumption S3 cannot be satisfied and hence it 
suffices to consider Assumptions SI and S2. 

Let us first assume X]i>o7« < 00 as well as A„ := n~" and o"„ := for 
n > 1 and constants a > and /? > 0. Then Assumption SI is met if a > Ad[3 
and 4a + 2/3 < 1, whereas Assumption S2 is met \i d(3 < a < idp and a + 
(2 + I2d)l3 < 1. In particular, for /3 = Assumption SI is met if < a < 1/4, 
whereas Assumption S2 cannot be met in this case. 

Finally, we consider a milder assumption on the decay of correlations, 
namely J2i=o 7« — ^(^ + lnn)~^, for a constant c > and all n > 1. Ob- 
viously, this is satisfied if we assume that (74) has some arbitrary poly- 
nomial decay. Let us consider the sequences Xn ■= (1 +lnn)~" and cj„ := 
(1 + Inn)'^ for n > 1 and a > and /? > 0. Then Assumption SI is met if 
a > Ad [3 and 4a + 2/3 < 1, whereas Assumption S2 is met if d/3 < a < Adfi 
and a + (2 + 12d)j3 < 1. In particular, for /3 = Assumption SI is met if 
< a < 1/4, while Assumption S2 cannot be met. 

The illustrations above show that both Assumptions SI and S2 consist of 
two contrary conditions, namely one which implies that A^ tends to and 
another one which ensures that this speed is not too fast. Roughly speaking, 
the first condition guarantees that the approximation error tends to zero 
(see Lemma 5.4), but since this simultaneously means that the statistical 
error becomes larger, the second condition is needed to ensure that the 
latter error still tends to zero (see the proof of Theorem 2.4). This trade-off 
between approximation and statistical error is typical for consistent learning 
algorithms (see the books [19] and [22] for several such examples). 

With the help of these assumptions we can now establish the announced 
consistency of SVMs. 

Theorem 2.4. Let Z = {Xi,Yi)i>o be a stochastic process satisfying As- 
sumption Z. We write P := /i(Xo,yo) ^^"^ assume that Z has a decay of 
correlations of some order {'Ji). In addition, let L : X x Y x M. ^ [0, 00) be 
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a loss satisfying Assumption L. Then for all sequences (A„) C (0, 1] and 
((Tn) C [l,cxo) satisfying Assumptions SI, S2 or S3 and all e G (0, 1] we have 

Jim^/x(w G 1^ : |7^L,p(/^„H,A„,<TJ - T^l,p\ > e) = 0, 

where r„(tj) := {{Xo{uj),Yo{uj)), (X„_i(a;), and /r„(^),A„,<7„ is 

the SVM forecaster defined by (5). 

Theorem 2.4 in particular applies to stochastic processes that are a- 
mixing with rate {'ji). However, the Assumptions SI, S2 and S3 ensuring 
consistency are substantially stronger than the ones obtained in [39] for 
such processes. On the other hand, there are interesting stochastic processes 
that are not a-mixing but still enjoy a reasonably fast decay of correlations. 
Since we are mainly interested in the forecasting problem we will delay the 
discussion of such examples to the next section. 

3. Consistency of SVMs for the forecasting problem. In this section we 
present our main result, which establishes the consistency of SVMs for the 
forecasting problem described by (l)-(4) if the dynamical system enjoys a 
certain decay of correlations. In addition, we discuss some examples of such 
systems. 

We begin by first revisiting our informal problem description given in the 
introduction. To this end, let M C M'^ be a compact set and F : M — > M be a 
map such that the dynamical system D := (-F*)i>o has a unique ergodic mea- 
sure /i. Moreover, let £ = {si)i>o be a M'^-valued stochastic process which 
is (stochastically) independent of D. Then the process that generates the 
noisy observations (1) is (F* + £i)i>o- In particular, a sequence of observa- 
tions (xo, • • • ,Xn) generated by this process is of the form (1) for a conjoint 
initial state. Now recall that, given an observation of the system at some ar- 
bitrary time, our goal is to forecast the next observable state. Consequently, 
we will use the training set 

Tn{x,e) := ((xo,Si), . . . , (Xn-l,X„)) 

(8) 

= ((x + So, F{x) + ei), (F"-i(x) + F'^(x) + 

whose input /output pairs are consecutive observable states. Now note that a 
single sample (-F*~^(x) + Si-i, F^{x) depends on the pair (ei,ej+i) and 
thus we have to consider the process of such pairs. The following assumption 
summarizes the needed requirements of the process M := ((ej, ej+i))i>o. 

Assumption N. For the M^'^- valued stochastic process M there exist 
a constant B > and a probability measure u on [—B,B]'^^" such that 
the coordinate process £ := (ttq o 5*)i>o is stationary with respect to v 
and satisfies J\f = (vro o S'*,7ro o S'^~^^)i>o, where S denotes the shift opera- 
tor (xj)i>o I— > (xj+i)j>o and ttq denotes the projection {xi)i>o i— > xq. 
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Before we state our main result we note that the input variable x + e and 
the output variable F[x) + e' are d-dimensional vectors. Consequently, our 
notion of a loss introduced in Section 1 needs a refinement which captures 
the ideas of the introduction. To this end we state the following assumption: 

Assumption LD. For the function L:W^ ^ [0, oo) there exists a distance- 
based loss satisfying Assumption L such that its representing function ip : 
— > [0, oo) has a unique global minimum at and satisfies 

(9) L(ri,...,rd)=V(ri) + --- + V'(rd), (n, . . . , r^) G M"^. 

Obviously, if L satisfies Assumption LD, then L is a loss in the sense of 
the introduction. Moreover note that the specific form (9) makes it possible 
to consider the coordinates of the output variable separately. Consequently, 
we will use the forecaster 

(10) fTX^^ '■= ifTW,X,a^ • • • ' fTW,X,a)^ 

where /^(j) ;s^ is the SVM solution obtained by considering the distance- 
based loss defined by ip and T^^^ := {{xQ,Trj{xi)), . . . , 7rj(x„))) which 
is obtained by projecting the output variable of T onto its jth-coordinate via 
the coordinate projection ttj : M.. In other words, we build the forecaster 
]t,\.(j by training d different SVMs on the training sets T^^^ , • . . , T^'^ . 

With the help of these preparations we can now present our main result, 
which establishes consistency for such a forecaster. 

Theorem 3.1. Let M he a compact set, F:M ^ M he a Lipschitz 
continuous map such that the dynamical system T> := {F'^)i>Q has a unique 
ergodic measure /i, and N he a stochastic process satisfying Assumption N. 
Assume that hoth processes V andM have a decay of correlations of the order 
(ji). Moreover, let LiM*^— > [0, oo) he a function satisfying Assumption LD. 
Then for all sequences (A,„) C (0, 1] and {an) C [1, oo) satisfying Assumptions 
SI, S2 or S3 and all e G (0, 1] we have 

Jim ^ i^iix, e) G M X [-B, B]^^ : |7^L,p(/r„(,,,),,„,.J - 7^2,p| >e)=0, 
where Tn{x,e) is defined by (8) and the risks are given hy (2) and (3). 

Note that if £ is an i.i.d. process, then M has a decay of correlations of 
any order. Moreover, if £ is a-mixing with mixing rate {ji), then M has a 
decay of correlations of order {"fi). Finally, if D has a decay of correlations 
(7^) and M has a decay of correlations {"fi), then they obviously both have 
a decay of correlations (74), where 7j := max{7-, 7-'}. In particular, noise 
processes having slowly decaying correlations will slow down learning even 
though the system D may have a fast decay of correlations. 
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Let us now discuss some examples of classes of dynamical systems enjoying 
at least a polynomial decay of correlations. Since the existing literature on 
such systems is vast these examples are only meant to be illustrations for 
situations where Theorem 3.1 can be applied and are not intended to provide 
an overview of known results. However, compilations of known results can 
be found in the survey articles [3, 28] and the book [2]. 

Example 3.2 (Smooth expanding dynamics). Let M be a compact con- 
nected Riemannian manifold and F : M ^ M be C^^'^ for some e > 0. Fur- 
thermore assume that there exist constants c > and A > 1 such that 

max{||L>F^"(t;)|| :x G M,v e T^M with = 1} > cA" 

for all n > 0, where TxM denotes the tangent space of M at x and DF^ 
denotes the derivative of F^ at x. Then it is a classical result that F pos- 
sesses a unique ergodic measure which is absolutely continuous with respect 
to the Riemannian volume. Moreover, it is well known (see, e.g., [33] and the 
references mentioned in [28] , Theorem 5) that there exists a r > such that 
the dynamical system has decay of correlations of the order (e~^*). Gener- 
alizations of this result to piecewise smooth and piecewise (non)-uniformly 
expanding dynamics are discussed in [3]. Finally, [28], Theorem 10, recalls re- 
sults (together with references) for non-uniformly expanding dynamics hav- 
ing either exponential or polynomial decay of correlations. 

Example 3.3 [Smooth hyperbolic dynamics). If F is a topologically mix- 
ing C^+^ Anosov or Axiom A diffeomorphism, then it is well known (see, 
e.g., [9, 34]) that there exists a r > such that the dynamical system has 
a decay of correlations of the order (e""^*). Moreover, Baldi [3] lists various 
extensions of this result to, for example, smooth nonuniformly hyperbolic 
systems and hyperbolic systems with singularities. 

Besides these classical results and their extensions, Baldi [3] also compiles 
a list of "parabolic" or "intermittent" systems having a polynomial decay. 

Let us now consider the forecasting problem for the least squares loss. 
To this end we first observe that the function L{r) := ||r||2, r € M*^, satisfies 
Assumption LD since the least squares loss satisfies Assumption L as we have 
discussed in Example 1.3. Let us now additionally assume that the noise is 
pairwise independent (i.e., £i and Si' are independent if i^ i') and centered 
[i.e., it satisfies Ee^i,7ro(e) = 0]. For a forecaster / = (/i, . . . , /d) : M'' ^ M'^ we 
then obtain 
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= // E(^i(^(^))-/i(^ + ^o))'Ka!eH^^2;)+ J \\eo\\lu{de) 

= :^L,p(/) + J \\eo\\liy{de), 

where Hj : M*^ — > M denotes the jth coordinate projection. Consequently, a 
forecaster / that approximately minimizes the L-risk is also an approximate 
forecaster of the true next state in the sense of TZl,p{')- Before we combine 
this observation with Theorem 3.1 let us first rephrase Assumptions SI, S2 
and S3 for the least squares loss. 

Assumption Sl-LS. For a strictly positive sequence {"fi)i>o converg- 
ing to the monotone sequences (A„) C (0,1] and {an) C [1,00) satisfy 

— —1/2 

lim„^oo A„ = 0, sup„>i e A„ ' < 00, 



2 11— 1 



supAnCr^'^/^ < cx) and lim — ^ V -vj = 0. 

n>l n\n i=0 

Assumption S2-LS. For a strictly positive sequence (7i)i>o converging 
to the sequences (A„) C (0, 1] and (0"^) C [1, 00) satisfy lim^^oo AnO"^ = 0, 

2+12d n-\ 

lim Xno^^^^ = 00 and lim — V 7i = 0. 

n-»oo n^oo nXn ~^ 

Assumption S3-LS. For a strictly positive sequence (7j)j>o converg- 
ing to the monotone sequences (A„,) C (0,1] and (an) C [l,oo) satisfy 

lim^^oo A„ = 0, lim^^oo e~'^" An^^^ = 00, 



n-l 



sup 



pXnaff^^<oo and lim — — "S^ 7^ = 0. 
n>i nXl ^ 

Remark 3.4. In order to illustrate the Assumptions Sl-LS, S2-LS and 
S3-LS, let us first assume J2i>oli < 00 as well as A„, := and (7„ := 
for n > 1 and constants a > and (3 >0. Then Assumption Sl-LS is met 
if 3a > 8dp > and 11a + 4/3 < 2, whereas Assumption S2-LS is met if 
a + (2 + 12d)(3 < 1 and dj3 <a < ^df3. Finally Assumption S3-LS is satisfied 
if/3 = OandO<a<l/7. 

Let us now consider the milder assumption n^^Yl^Zoli ^ c(l +lnn)~^ 
which has already been considered in Remark 2.3 for Lipschitz continuous 
losses. To this end, we again consider the sequences A„, := (1 + lnn)~" and 
Gn '■= (1 + Inn)^ for n > 1 and constants a > and /3 > 0. Then Assumption 
Sl-LS is met if 3a > Sd/? > and 11a + 4/? < 2, whereas Assumption S2-LS 
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is met if a + (2 + l2d)P < 1 and d(3 <a< ^d(3. Finally Assumption S3-LS 
is satisfied if /3 = and < a < 1/7. 

With these preparations we can now state a result showing that SVMs 
using a least squares loss can be used to forecast the next true state of the 
dynamical system if the observational noise is sufficiently benign. 

Corollary 3.5. Let M C M'^ he a compact set and F : M ^ M be a 
Lipschitz continuous map such that the dynamical system V := {F^)i>o has 
a unique ergodic measure fx. Moreover, let £ = {£i)i>o be an i.i.d. process of 
[— B , B]'^ -valued and centered random variables. Assume that T) has a decay 
of correlations of the order (7j). Moreover, let L:W^ ^ [0, oo) be defined by 
L{r) := ||r||2, r £ . Then for all sequences (A^) C (0,1] and {an) C [l,c>o) 
satisfying Assumptions Sl-LS, S2-LS or S3-LS and all e G (0, 1] we have 

Jim /i ® uiix, e) G M X [-B, Bf"" : |^l,p(/t„(.,.),a„,<xJ - ^LpI >e)=0, 
where Ulp := mi{TZL,p{f)\f ■■W'^W^ measurable}. 

It is interesting to note that the above corollary does not require the 
noise to be symmetric. Instead it only requires centered noise, that is, the 
observations are not systematically biased in a certain direction. 

Let us end this section with the following remark that rephrases Theorem 
3.1 and its corollary for situations with summable decays of correlations. 

Remark 3.6 ( Universal consistency). If the sequence {'ji) bounding the 
correlation is summable, that is, J2li < oo, then the Assumptions SI, S2, S3, 
Sl-LS, S2-LS and S3-LS are independent of both the dynamical system and 
the observational noise process. Consequently, using sequences satisfying one 
of these assumptions yields an SVM which is consistent for all such pairs 
of dynamical systems and observational noise processes. In other words, 
such an SVM can learn the optimal forecaster without knowing specifics 
of the dynamical systems and the observational noise. To be a bit more 
specific, let us assume, for example, that we use the least squares loss and 
sequences A„ := and cr„ := n^, n > 1, for fixed a and /3 satisfying 3a > 
8dp > and lla + 4/3 < 2. Then the corresponding SVM is consistent for 
all bounded observational noise processes having a summable a-mixing rate 
and all ergodic dynamical systems on M which are defined by a Lipschitz 
continuous F -.M ^ M and have a summable decay of correlations. Note 
that this class of dynamical systems includes, but is not limited to, smooth 
uniformly expanding or hyperbolic dynamics. Finally, if the noise process is 
also i.i.d. and centered then this SVM actually learns to forecast the next 
true state. 
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It is interesting to note that a similar consistency result holds for all 
noise processes having a polynomial decay of a-mixing coefficients and all 
ergodic dynamical systems on M which are defined by a Lipschitz continuous 
F :M ^ M and have a polynomial decay of correlations. Indeed, for such 
combinations SVMs using sequences := (1 + lnn)~" and 0"„ := (1 + lnn)^ 
with, for example, fixed a and /3 satisfying 3a > 8df3 > and 11a + 4/3 < 2 
are consistent. 

4. Discussion. The goal of this work was to show that, in principle, sup- 
port vector machines can learn how to predict one-step-ahead noisy obser- 
vation of a dynamical system without knowing specifics of the dynamical 
system or the observational noise besides a certain, rather general stochas- 
ticity. However, there remain several open questions which can be subject 
to further research: 

More general losses and kernels. In the statistical part of our analysis, 
we used an approach which is based on a "stability" argument. However, 
it is also possible to use a "skeleton" argument based on covering numbers, 
instead. Utilizing the latter, it seems possible to relax the assumptions on the 
loss L by making stronger assumptions on both (A^) and {an)- A particular 
loss which is interesting in this direction would be the e-insensitive loss used 
in classical SVMs for regression. Another possible extension of our work 
is considering different kernels, such as the kernels that generate Sobolev 
spaces. In fact, we only focused on Gaussian RBF kernels since these kernels 
are the most commonly used in practice. 

Learning rates. So far we have only shown that the risk of the SVM 
solution converges to the smallest possible risk. However, for practical con- 
siderations the speed of this convergence is of great importance, too. The 
proof we utilized already gives such learning rates if a quantitative version of 
the Approximation Lemma 5.4 is available, which is possible if, for example, 
quantitative assumptions on the smoothness of F and the regularity of v are 
made. However, since we conjecture that the statistical part of our analysis 
is not sharp we have not presented a corresponding result. In this regard we 
note that recently [14] established a concentration result for piecewise regu- 
lar expanding and topologically mixing maps of the interval [0, 1], which is 
substantially stronger than our elementary Chebyshev inequality of Lemma 
5.8. We believe that such a concentration result can be used to substantially 
sharpen the statistical part of our analysis. 

Perturbed dynamics. Another extension of the current work is to consider 
systems that are perturbed by some noise. Our general consistency result 
in Theorem 2.4 suggests that such an extension is possible whenever the 
perturbed system has a decay of correlations. In this regard we note that 
for some perturbed systems of expanding maps the decay of correlations 
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has already been bounded in [5], and it would be interesting to investigate 
whether they can be used to prove consistency of SVMs. 

Longer past. So far, we have only used the present observation to fore- 
cast the next observation, but it is not hard to see that in almost any sys- 
tem/noise combination the minimal risk 7?.^ p reduces if one uses additional 
past observations. On the other hand it appears that the learning problem 
becomes harder in this case since we have to approximate a function which 
lives on a higher dimensional input space, and hence there seems to be a 
trade-off for finite sample sizes. While investigating this trade-off in more 
detail seems to be possible with the techniques developed in this work, we 
again assume that the statistical part of our analysis is not sharp enough to 
obtain a meaningful picture of this trade-off. 

5. Proof of Theorem 2.4. The goal of this section is to prove Theorem 
2.4. Since the proof requires several preliminary results, we divided this 
section into subsections, which provide these prerequisites. 

5.1. Some basics on the decay of correlations. The main goal of this 
section is to establish some uniform bounds on the sequence of correlations. 

Let us begin introducing some notation. To this end, we fix a probabil- 
ity space {Q,A,fi), a measurable space {Z,B) and a Z-valued, identically 
distributed process on J7. For P := fizo and ip,ip £ L2{P) we then write 
coiziip,^) '■= {cor:z,n{'4',^))n>o for the sequence of correlations of ip and 
ip. Clearly, this gives a bilinear map corz '■ L2{P) x L2{P) — >^oo) which in 
the following is called the correlation operator. The following key theorem, 
which goes back to an unpublished note [13] of Collet (see also page 101 in 
[4]), can be used to establish continuity of the correlation operator. Before 
we present this result let us first recall that a Banach space E is said to be 
continuously embedded into the Banach space F \i E d F and the natural 
inclusion map id'.E ^ F is continuous. 

Theorem 5.1. Let {Vl,A,^) he a probability space, {Z,B) be a mea- 
surable space, Z be a Z-valued, identically distributed process on O and 
P:= fiZo- Moreover, let Ei and E2 be Banach spaces that are continuously 
embedded into L2{P) and let F be a Banach space that is continuously em- 
bedded into loo- If for all tJj £ Ei and all f (z E2 the correlation operator 
satisfies 

coi: z{ip,Lp) G F, 
then there exists a constant c £ [0, 00) such that 

\\coiz{'>P,^)\\f <c- ll^lbilly'llEa, ip G Ei,ip G E2. 
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For the sake of completeness the proof of this key result can be found in 
the Appendix. The most obvious examples of Banach spaces F in the above 
theorem are the spaces (.p. However, in the literature on dynamical systems 
results on the sequence of correlations are usually stated in the form 

\cOT: Z,n{lp,^)\ < H^.ipln, " > 0, 

where (7,1) is a strictly positive sequence converging to and k^,^^ is a con- 
stant depending on tjj and ip. To apply Theorem 5.1 in this situation we 
obviously need Banach spaces which capture such a behavior of cor^ ( • , • ) • 
Therefore, let us fix a strictly positive sequence 7 := (7n)n>o such that 
lim„_>oo7n = 0. For a sequence h := (bn) C M we define 

II^IIa(7) :=sup^. 

n>0 7n 

Moreover, we write 

A(7):={(Mc]R:||(MIIa(7)<oo}. 
The following lemma establishes some basic properties of (A(7), || • ||a(7))- 

Lemma 5.2. The pair (A(7),|| • ||a(7)) is a Banach space continuously 
embedded into £00 and we have ||id:A(7) — >^oo|| < IItIIoo- 

Proof. The fact that (A(7), || • ||a(7)) is a normed space is elementary 
to prove. Moreover, we have 

II^IIa(7) = sup > sup —— = , 

n>0 7n n>0 ||7l|oo IITIIoo 

and hence we find ||id : A(7) — > ^cxd|| < IItIIoo- Finally, let (6^*'')j>i be a Cauchy 
sequence in A(7). The previous step shows that it is also a Cauchy sequence 
in ) and by the completeness of ioo there consequently exists a sequence 
b '■= {bn) £ ioo such that limj^oo 11^^*^ — b\\oo = 0. Let us now fix an e > 0. 
Then there exists an index io>0 such that for all i,j>iQ we have — 
^^"'^IIa(7) — ^- Consequently, for fixed > we have 

sup ^^^-^<||^«-5(^-)||A(,)<e, 

n=0,...,N In 

and by taking the limit j — > co we conclude 

\h'i^ -hn\ ^ 

sup < £. 

n=0,...,N In 

However, N was arbitrary and hence we find — 6||a(7) < e for all i > iQ. 
In other words we have shown that (6''*^)i>i converges to 6 in || • ||a(7)- D 
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Combining the above lemma with Theorem 5.1 we immediately obtain 
the following corollary: 

Corollary 5.3. Let {0,,A,fi) be a probability space, {Z,B) be a mea- 
surable space, Z be a Z-valued, identically distributed process on and 
P := fiZo- Moreover, let Ei and E2 be Banach spaces that are continuously 
embedded into L2{P). In addition, let 7 := (7n)n>o a strictly positive se- 
quence such that lim^^oo 7n = 0. If for all ip ^ Ei and all ip (z E2 there exists 
a constant K^^^p G [0, 00) such that 

for all n > 0, then there exists a constant c G [0, co) such that 

I cor2_„(V',<^)| < cllV'Ibi • \W\\e2 - In, Ei,Lp£ E2,n> 0. 

5.2. Some properties of Gaussian RBF kernels. In this subsection we 
establish some properties of Gaussian RBF kernels which will be heavily 
used in the proof of Theorem 2.4. Let us begin with an approximation result. 

Lemma 5.4. Let X CW^ and y C E 6e compact subsets, L : X x y x M ^ 
[0,00) be a convex locally Lipschitz continuous loss and P be a probability 
measure on X x Y such that TZl,p{0) < 00. Then for all sequences (A„) C 
(0,1] and (an) C [l,cx)) satisfying 

(11) limA„(7^ = 0, 

n—>co 

we have 

Proof. For a > we write I^Iph^{x) •= : / ^ HaiX)}. 

Since L is locally Lipschitz continuous and TZl,p{0) < 00, the discussion 
after (4) in [38] shows that it is a P-integrable Nemitski loss in the sense 
of [38]. Now recall (see [37]) that IIu{X) is universal, that is, it is dense in 
C{X), and hence [38], Corollary 1, shows T^l p h„{x) ~^*lp a > 0. 

Let us now fix an e > 0. The above discussion then shows that there exists 
an /e E Hi{X) such that 'JZL,p{fe) < T^l p + Furthermore, by (11) there 
exists an no > such that 

Kcrt<e\\fe\\]jl(^j^y n>no. 

Since cr.„ > 1 we also know fe G H^^{X) and Wf^f^^^^^^ < <yi\\fe\?H^i^x) 
[40], Corollary 6, and therefore we obtain 
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for all n > TiQ. From this we easily deduce the assertion. □ 

Before we establish the next result let us recall that a function / : X ^ R 
on a subset X C M*^ is called Lipschitz continuous if there exists a constant 
c G [0, cxo) such that \f{x) — f{x')\ < c||x — x'||2 for all x, x' E X. In the follow- 
ing the smallest such constant is denoted by and the set of all Lipschitz 
continuous functions is denoted by Lip(X). Moreover, recall that if X is 
compact then Lip{X) together with the norm ||/||Lip(x) ■= max{||/||oo, |/|i} 
forms a Banach space. In this case Lip(X) is also closed under multiplication. 
Indeed, for f,g£ Lip(X) and x,x' £ X we have 

\f{x)g{x) - f{x')g{x')\ < • bilk " ^'1 + l/li • MUx - x'\ 

and hence we obtain fg £ Lip(X) with |/5f|i < ||/||oo -1511 + l/|i • IblU- Our 
next result shows that every function in Hu{X) is Lipschitz continuous. 

Lemma 5.5. Let X C M.'^ be a nonempty set and a > 0. Then every 
f G H^{X) is Lipschitz continuous with |/|i < V^cH/UhctCX) ■ 

Proof. Let us write $ : X — > H„[X) for the canonical feature map de- 
fined by ^{x) := ka{x^ •). Now recall that $ satisfies the reproducing property 

f{x) = {^{x),f), xeXjeH^ix), 

and hence in particular ka{x' , x) = $(x')) for all x, x' £ X. Using these 

equalities together with 1 — e~* < t for t > we obtain 

\fix)-f{x')\ = mx)-^x')j)\ 

<\\f\\HAX)-\mx)-Hx')\\H.iX) 

= \\f\\HAx)^J{Hx),^x)) + mx'),^{x')) -2mx),<S>{x')) 

= ll/llH.(X)V2-2exp(-a2||x-x'||i) 

< V2a\\f\\H„^x)\\x - x'\\2, 
that is, we have proved the assertion. □ 

In the following we consider certain orthonormal bases (ONBs) of Ha{X). 
To this end, let us first recall that in [40], Theorem 5, it was shown that 
(en)n>0) where 6^ : M — > R is defined by 

(12) en{x) := \ x^e--^ % x e R, 

forms an ONB of /^^-(IR)- Moreover, it was shown that if X C R has a 
nonempty interior, the restrictions of e„ to X form an ONB of H„{X). 



20 



I. STEINWART AND M. ANGHEL 



The fohowing lemma estabUshes upper bounds on ||6n||oo if is a closed 
interval. 

Lemma 5.6. Let a > and a > be fixed real numbers and (en)n>o be 
the ONB of Ha{[—a^a\), where en is defined by the restriction of (12) to 
[—a, a]. Then we have ||e„||oo < (27rn)~^/^ for alln>l and 



(13) \\en\\oo<\-^-^e-'^''^' 

V n! 

for all n > 2a} . In addition, for n > Sea^a'^ we have 

1/2 , , ,1/4 



{n+l)-a^a 



and for aa >1 we also have 



2„2 



/ oo \ 1/2 

(15) Ell^^IlL <V6^- 



u=0 



Proof. Elementary calculus shows 



for all n > 1 and x € M. From this we conclude e^(x*) = if and only if 
X* = zb./Tr^ or X* = 0. Therefore it is not hard to see that the function 

defined in (12) attains its global extrema at x* = ±y^^^, and hence we 
obtain 



|e„||oo<y ^^,e <W^^^^_^e - (2vrnj 



for all n > 1 by Stirling's formula. Moreover, n > 2a^cr^ implies |x*| >a and, 
in this case, it is not hard to see that the function \en\ actually attains its 
maximum at iba. From these considerations we conclude (13). 

For the proof of (14) we recall that the remainder of the Taylor series of 
the exponential function satisfies 



. ^^i\ - {n + iy. 
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for \y\ < 1 + n/2. Since n > Sea'^a^ implies 2a^a^ < l + n/2, we consequently 
obtain 

. ^ "~ ~ . i\ ~ (n + 1)! 

2n+l„2(n+l)^2(n+l)g(n+l) ^ 

< 2 e~ 

~ V27r(n + l)(n + 1 )("+!) 

< ( ^ y^\-(n+l)^-2aV^_ 



,7r(77,+ 1), 

From this we easily deduce (14). Finally, for the proof of (15), we observe 
Y: ||e.||L<l + (2vr)-V^+ E (2-)"'/' 

i=0 i=2 



<l + (2^)-V2 + (2^)-i/2^ 
<l + (2^)-^/2^(e/7r)-^/24^^ 
<3/2 + 4a(j. 

Combining this estimate with (14), we then obtain 



oo 

Ell^*lloo~ E ll^«lloo+ E ll^illoo 
i=0 1=0 i=\Sea?a^^^,+l 

< 3/2 + 4aa + f '^'4-(r8eaV^l+i)^-2aV^ 

- ' Vvr(r8eaV2] +1)7 

1 \i/2 



< 2 + 4a(T, 

and from the latter we easily obtain (15). □ 

Our next goal is to generalize the above result to the multi-dimensional 
case. To this end, recall that the tensor product f ® g : X y. X oi two 
functions /, : X — > R is defined by / (8> g{x^ x') := f{x)g{x') , x,x' £ X. Ob- 
viously, for bounded functions we have ||/<8)5||oo = ||/||oo||<7||oo- 

For a multi-index rj = {ni , . . . , n^) G Ng we use the notation r]>niini>n 
for all i = 1, . . . ,d. Moreover, we write 

(16) := e„i • • • e„^, 77 = (ni, . . . , n^) G Nq, 
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where 6^ is defined by (12). Then [40], Theorem 5, shows that (er;)^gN,'5 

is an ONB of Hfj{W^) and the restrictions of the members of this ONB to 
[— a,a]'^ form an ONB of H(^{[—a,a]'^). The following lemma generalizes the 
estimates of Lemma 5.6 to this multi-dimensional ONB. 

Corollary 5.7. For a > and a > satisfying aa >1 and d G N, let 
(ery)^gNd be the restriction of the ONB (16) to [—a,a]'^. Then for n>Sea^a^ 
we have 

{ Y. l|e.,|lL)'"<V5e-"''"«>a.)(-')/'(^^)"'2-("«). 

3i:rii>n 

Proof. Using \\ei^ (K) • • • ® e^^ ||oo = ||eij|oo • • • ||ej^||oo we obtain 

oo oo oo d 

E lML<d E E-Eni 



le- IP 
l^ij lloo 



3i:rii>n 

/ oo \ / oo \ d,— l 

|2 1 



= d{ E INIL El 

\i=n+l / \i=0 J 



.vr(n + 1) 

by Lemma 5.6. From this we immediately obtain the assertion. □ 

5.3. A concentration inequality in RKHSs. In this subsection we will 
establish a concentration inequality for RKHS-valued functions and for pro- 
cesses which have a certain decay of correlations. This concentration result 
will then be the key ingredient in the statistical analysis of the proof of 
Theorem 2.4. 

Let us begin by recalling a simple inequality that will be used several 
times: 

Lemma 5.8. Let Z = {Zi)i>Q be a second-order stationary Z-valued pro- 
cess on {0,,A,fJ,). Then for P := ^izq, f G L2{P), n>l and 5 >Q we have 



(17) ^ wGO: 



^ n— 1 



-Y^foZi{u:)-¥.pf 



n . „ 

1=0 



2 



j=0 



For the following results we have to introduce more notation: Given a 
bounded measurable kernel fc : X x X ^ M with RKHS we write <1> : X — > 
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H, ^(x) := k{x, •) for the canonical feature map. Moreover, for a bounded 
measurable function /i : X x y — > M and a distribution P on X x y we write 
Ep/i$ for the Bochner integral (see [20]) of the //-valued function h^. Sim- 
ilarly, given T := . . . , {xn,yn)) £ (X x y)" we write 



for the empirical counterpart of Ep/i$. In order to motivate the following 
results we further mention that the proof of Theorem 2.4 will heavily rely 
on the estimate 



where fp^x^H is the SVM solution (see Theorem 5.12 for an exact definition) 
one obtains by replacing the empirical risk TZl^t{-) with the true risk TZl,p{-) 
in (5) and h\ is a function independent of the training set T. Consequently, 
our next goal is to estimate terms of the form ||Ep/i<^ — Et^/i^H// . To this 
end, we begin with the following lemma which, roughly speaking, will be 
used to reduce RKHS-valued functions to M-valued functions. 

Lemma 5.9. Let H he the separable RKHS of a hounded measurable 
kernel k-.XxX^'M., let ^ :X ^ H be the corresponding canonical feature 
map and (ej)j>o be an ONB of H . Moreover, let Y be another measurable 
space, P and Q be probability measures on X x Y and h G Li{P) n Li{Q). 
Then for all n>0 we have 



Proof. Let us define Sn'-H^Hhy "Ei^oif, ei)ei ^ Er=o(/> ei)ei. Then 



(18) 



1 



i=l 



\\fp,\,H - fT,\,H\\H < T-Pp/IA^ - Erhx^H 



WEph^-Egh^H 




we have 



oo 



2 



SnHx)-Hx)\\j, 



{Hx),ei)ei 



i=n+l 



H 



oo 



i=n+l 



oo 



i=n+l 
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by the reproducing property and hence we obtain 

WEph^-Egh^H 

< \\Eph<l> - EphSn^H + \\EphSn<^ - EghSn^H 
+ \\EQhSn<^-EQh^H 

< Ep|/l|||$ - SnMlH + WEphSn'^ - EghSnHH 

+ EQ\h\\\^ - Sn^H 

/ oo \ 1/2 

<\\EphSn^-EQhSnnH+( E ll^ill^ 



X {Ep\h\+EQ\h\). 



Moreover, using the reproducing property we have {Eph^,ei) = Ephei and 
(EQ/i$,ej) =EQhei, and thus we conclude 

n n 

WEphSn'^ - EghSnHl = E K^^^^^ " ^^0^^' = E \^P^^^ " ^Q^^il^ 

i=0 i=0 

Combining this equahty with the previous estimate, we find the assertion. 
□ 

Before we can establish the concentration inequality for RKHS-valued 
functions, we finally need the following simple lemma. 

Lemma 5.10. Ford>l and t > 18 d\n{d) we have t~^/^2~* Kf'^'^. 

Proof. Obviously, it suffices to show 
(19) tln2 + (l/4-2(i)lnt>0. 

Let us first prove the case d=l. Then (19) reduces to the assertion h{t) := 
t ln2 — I Int > 0. To establish the latter, note that we have h'{t) = ln2 — 
and hence h'{t*) = holds if and only if t* = Simple considerations 

then show that h has its only global minimum at t* and therefore we have 
hit)>hit*)>l-lln{^)>0. 

Let us now consider the case d>2. To this end we fix a t > 18dln{d). 
Then there exists a unique x > 18 with t = xdln(d), and hence we obtain 

tln2 + (1/4 - 2(i) Int = xd\n{d) ln2 + l/41n(a; dln(d)) - 2dln{x dln{d)) 

> xdln{d) In 2 - 2dln{x dln{d)) 

= d{x ln(d) ln2-21na;-21nd-2 ln(ln(cf))) 

> d(x\n{d) ln2 - 2^^ Inx - 21n(i - 21nd'j 
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/ 2 
d ln(d) f a; In 2 — |— ^ In x 



where in the last estimate we used d>2. Now it is elementary to check that 
X I— > 2;ln2 — j^lnx — 4 is increasing on [2(ln2)~^, oo) and since 181n2 — 
In 18 - 4 > 0, we then obtain (19). □ 

Theorem 5.11. For a > and a > satisfying aa >1 and d>l, let 
$ : [— a, a]'^ — > i?(j([— a, o]^) be the canonical feature map of the Gaussian 
RBF kernel and let (e^)^gpj(i be the ONB of H^{[—a,a]'^) which is con- 
sidered in Corollary 5.7. In addition, let Y be a measurable space and let 
Z = {Xi,Yi)i>o be a [—a,a]'^ x Y -valued process on {Q,A,fi) that is second- 
order stationary. Furthermore, let {■ji)i>o be a strictly positive sequence con- 
verging to zero, h: [—a,a]'^ x y — > M be a bounded measurable function and 
Kfi G [l,oo) be a constant such that 

(20) COT z,iiher,, hCrj) < Kh-fi 

for all i>0, rj e Nq. Then for all e > satisfying both e < (1 + Sea^cr^)"^'^ 
and e < (ISdlnd)"^'^ and all n>l we have 

niLoGQ: \\Eph<^ - ET„(^^)h<l>\\fj < e) 

2{l + {l/{8ea^a')rK,Cl^,^, -^ 

1=0 

where Ej'^(^)/i$ denotes the empirical Bochner integral (18) with respect to 
the data set Tn{u)) := {Zq{u;),. . . ,Zn-i{uj)), and 

1 \'i/2 

Caa4,h := ( 1 + 



Proof. Let us write 



Caa,d,h 

Using Caa^d,h > (1 + s^)'^''^ > 1 and e < (1 + 8eaV2)-2'^ we then find 
(5 < (1 + Sea^iT^)"^'^/^, and consequently, there exists a natural number m > 
Sea^o"^ such that (m + l)~5'^/2 <5< m"^'^/^. Moreover, for later use we note 
that using Caa,d,h > 1 and e < (ISdlnd)-^'^ yields 6'^^^'^ > ISdlnd. Let us 
now consider an w G such that 

(21) \Epherj - Er„(a;)/ie^| < S 



Sea'^a^, 
+ 2Vde-'^'"'(6aa)(^-i)/2||;j|| 



5/4 

6:-- 
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for all rj G {0, . . . ,m}'^. By Lemma 5.9 and Corollary 5.7 we then obtain 



\r?<m 



\ 1/2 
i2 \ 



7r(m + 1) 
V Sealer"' / 



1 xrf/s 



I OO ) 



where in the last step we used the inequalities Sea^cr^ <m < 6 ^^'^^ < m + 1. 
Using Lemma 5.10 for t := we consequently obtain 



||Ep/i^>-Er„(^)/i$||j^ 

1 \d/2 

Moreover, by Lemma 5.8 and a simple union bound argument we see that 
the probability of a; satisfying (21) for all rj G {0, . . . simultaneously is 
not smaller than 

2 

^~ E E coj:z,i{her,,herj). 

r)e{0,...,m}<* «=0 

In addition, we have 

r)G{0,...,m}'* «=0 i=0 



and since Sea^cr^ < ?ti < 5 2/(5d) further estimate 
^ 2(l + l/(8eaV2))'^m^ ^ 



2(?n + l)'' 2(l + l/(8eaV2))'^m^ 2(1 + l/(8ea2cj2))'^ 



2(l + l/(8eaV2))'^C3 



Combining these estimates we then obtain the assertion. □ 
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5.4. Proof of Theorem 2.4- For the proof of Theorem 2.4 we need some 
final preparations. Let us begin with the following result on the existence 
and uniqueness of infinite sample SVMs which is a slight extension of similar 
results established in [12, 18]: 



Theorem 5.12. Let L:XxYxR—^[0, oo) be a convex, locally Lipschitz 
continuous loss function satisfying L{x,y,0) < 1 for all {x,y) G {X x Y), 
and let P be a distribution on X x Y . Furthermore, let H be a RKHS of a 
bounded measurable kernel over X . Then for all X> there exists exactly 
one element fp^x,H G H such that 

(22) X\\fp^x,HfH+nL,pifP,X,H)=jnlX\\ffH+nLAf)- 
Furthermore, we have \\fp^\H\\H ^ . 

Note that the above theorem in particular yields ||/T,A,H||iy 

< A-V2 by 

considering the empirical measure associated to a training set T £ (X x 1")". 
The following result which was (essentially) shown in [12, 18] describes the 
stability of the empirical SVM solutions. 



Theorem 5.13. Let X be a separable metric space, LiXxYxM.^ 
[0, oo) be a convex, locally Lipschitz continuous loss function satisfying 
L{x,y,0) < 1 for all {x,y) G {X x Y) and let P be a distribution on X xY . 
Furthermore, let H be the RKHS of a bounded continuous kernel k:X x X ^ 
M and let ^: X ^ H be the corresponding canonical feature map. Then for 
all X> the function h\:XxY^M. defined by 

(23) hx{x,y):=L'{x,yJp,x{x)), {x,y)eXxY, 
is bounded and satisfies and 

(24) \\fp,x,H-fT,x,H\\H<lmphx<^-EThxnH, T e {X X Y)^ . 

Proof of Theorem 2.4. Obviously, it suffices to consider sets X of 
the form X = [—a,a]'^ for some a > 1. For a > and A > we write hx^a 
for the function we obtain by Theorem 5.13 for H := H„{X). By the local 
Lipschitz continuity of L, ||fccr||oo < 1, Theorem 5.12 and (24) we then have 

\nL,p{fT.X,a)-nL,p{fP,X,'r)\ 

(25) 

- ^'x''^ \\^Phx,a^ - EThx,anH^{X) 
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for all cr > 0, A > and all T e {X x y)". Moreover, using (23), (6) and 
Lemma 5.5 we have 

\h\,a{x,y) - hx,a{x,y')\ 

= \L'{x,y,fp^x,a{x)) - L'{x' ,y' ,fpxa{x'))\ 

<c-(\x- xf + \y- y'f + \fp,xA^) - /p,A,.(x')P)'/' 

<c-{\x- xf + |y - 2/f + 2a2||/p,,,,||2^^(^)|x - xffi'' 

<2ca\-^l''\\{x,y)-{x\y')h 

for all fj > 1, A G (0, 1] and all {x',y') e X xY. Consequently, we find 

|^a,<t|i < 2ccrA-i/2_ Moreover, we have 

(26) \hxAx,y)\ = \L'ix,y,fp^xA^))\< sup |L'(x, y, t)| < |L|^_V2 i 

|t|<A-i/2 

for all A > and all {x,y) X xY. Let us now write el^^ for the 77th ele- 
ment, 77 S Ng, of the ONB of Ha-{X) considered in Corollary 5.7. Combin- 
ing the above estimates with Lemma 5.5 and the trivial bound ||eif^^||oo < 
Ikr;'^^ ||_f/CT(X) < 1 we obtain 

|/iA,aeM|i < |/iA,a|i||eM||^ + ||/iA,a||oo|e('^)|i < ScaA-i/^ 

for all A G (0,1] and cr > 1, where in the last step we used the estimate 
IMa,! < c(l + a), a > 0, which we derived after Assumption L. Since we fur- 
ther have ll/iA.aeJf^lloo < ||/iA,a||oo < 2cA~^/2^ ||/iA,ae^r^||Lip(Xxy) < 
5ca"A~^/^. Moreover, by Corollary 5.3 we may assume without loss of gener- 
ality that K^^^ is of the form k^^^ = cz\\tp\\up{XxY)\\v^\\up{XxY), where cz 
is a constant only depending on Z and (74). Consequently, we obtain 



cor^. 



.(/iA,.eM,/iA,<xeM)| < 25czc^X-^a\ 



for all (7 > 1, AG (0,1], and ij £ Nq, that is, (20) is satisfied for K/^^^^ := 
cX~^a^, where cr > 1, A G (0, 1], and c is a constant independent of A and a. 
For n>l and e > satisfying both 

(27) e < (1 + 8eaV2)-2rf|L|^_i/2^^A-^ 

and e < (ISdlnd)"^"^, Theorem 5.11 together with (25) and (26) thus yields 

H{uj£n: |'7^L,p(/r„H,A.a) - T^L,p{fPX<r)\ > e) 

(28) ^ _ 

2c(l + l/{^ea^a^)YCl^^,^,\L\l_,,,^^a^ 

2=0 
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where (5A,<x,d,a := (1 + 8^)'^/2 + 2Vde-"'-'(6af7)('^-i)/2|L|;^-i/2 Using the 
fact that the function x i— > +x^{d-i)/2 jg bounded on [0, oo) we further 
obtain 

(29) (5A„,.„,d,a < Cd{l + e-'^'^"|L| ^), 

where Cd is a constant only depending on d. Let us now consider the case 
where Assumption SI is fulfiUed. Then we have Cd,a '■= sup„>i C\^^a„,d,a < oo 
and 

£0 := inf (1 + 8eaV2)-2rf|L| A^^ > 0, 

n>l ■^n :^ 

and hence (27) is satisfied for aU e S (0, Eq] . Moreover, by the remark after As- 
sumption L we have |L|;^_i/2 ^ < c(l + A~^/^) for ah A > and hence the first 
and third assumption of SI together with cr„ < (T„+i imply lim„_>oo AnC^ = 0. 
By Lemma 5.4 we thus find lim„^oo'^L,p(/p,A„,o-„) =^lp- Consequently, 

(28) shows that for sufficiently large n and e G (0,eo] we have 

\L\^-i/2 "-^ 

^(o; G J7 : \nL,p{fTA<^)M,^J -n,p\ > 2e) < 2('^+i)/^cC|, ^ ^ 7i, 

and hence we obtain the assertion by the last condition of Assumption SI. 

Let us now consider the case where Assumption S2 is fulfilled. Then it is 
easy to see that the second assumption of S2 implies lim^^oo o'~'^'^\L\ 1/2 = 

0, which in turn yields sup„>i e~'^"|L| 1/2 < 00. From this we conclude 

— A„ ,1 

a '. — sup„>]^ C\^^ (j^ (i,a ^ 00 by (29). Moreover, a simple consideration shows 
we have := (1 + ^eo?a'i)~'^'^\L\ 1/2 , 0. For a fixed e > we thus 

have £n ^ £ for all sufficiently large n. Therefore we find 

fi{u;Gn: \nLAfTAu)M,^J " ^2,pI > 2e) 

n n j_o " 4=0 

for all sufficiently large n, where Cd,a is a constant only depending on d 
and a. From this estimate we obtain the assertion by the last condition of 
Assumption S2. 

Finally, let us consider the case where Assumption S3 is satisfied. Using 

(29) and a > 1 we then obtain for sufficiently large n and e G (0,eo] that 

/.(o; G : |7^i,p(/T„(.),An,.J " n,p\ > 2s) < Cd ^ 7. , 

where (7^ is a constant only depending on d. □ 
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6. Proof of Theorem 3.1. For the proof of Theorem 3.1 we need to bound 
the correlation sequences for stochastic processes which are the sum of a 
dynamical system and an observational noise process. This is the goal of the 
following results. We begin with a lemma which computes the correlation of 
a joint process from the correlations of its components. 

Lemma 6.1. Let X = (Xj)j>o he an X -valued, identically distributed 
process defined on [Q.,A,^) and y = (li)i>o he a Y -valued, identically dis- 
tributed stochastic process defined on Then the process Z = (Zj)j>o 
defined on {Vt x Q,A®B,^® v) by Zi := is identically distributed 
with P := {fi(d iy)zo = fJ-Xo ® i^Yo ■ Moreover, for tp,(p G L2{P) we have 

coiz^i^ip, if) = Ej, corx,ii'^{ • , lo), • , Yi)) + E^E^ cor3;,i(^(Xo, ■),ip{XQ, •)), 

where X'q is an independent copy of Xq. 

Proof. The first assertion regarding P is obvious. For the second asser- 
tion we fix an independent copy X' = {X-)i>Q of X. Then an easy calculation 
using the fact that both X and y are identically distributed yields 

= E^E^^l^{Xo,Yo)^{X„Yi) -K^E^il^{Xo,Yo) -K^EM^o^Yo) 
= E^E,iP{Xo,Yo)ip{X,,Yi) - E^E^E,xl;{Xo,Yo)ip{X'o,Yi) 

+ E^E^E,ij{Xo,Yo)ip{Xl„Yi) - E^E,V'(^o, ^o) • E^EMXo,Yo) 
= E,(E^^(Xo, YoMX,,Y,) - E^E^i;{Xo,Yo)ip{X'^„ Y,)) 

+ E^E^E,^P{Xo,Yo)ipiX'o,Y^) - E^E^{E,iP{Xo,Yo) ■EMXo,Yo)) 
= E,(E^^(Xo, Yo)ip{X^,Y,) - E^^(Xo, ^0) • E^ip{Xo,Yi)) 

+ E^E^{E,4j{Xo,Yo)ip{Xl,,Y,) - E,4j{Xo,Yo) • EMXo,Yo)) 
= E, coTx4i;{;Yo),^i;Y,)) + E^E^ coiy4iPiXo,-),^{X'o, ■)), 
that is, we have proved the assertion. □ 

The following elementary lemma establishes the Lipschitz continuity of 
a certain type of function which is important when considering the process 
that generates noisy observations of a dynamical system. 

Lemma 6.2. Let M CR'^ be a compact subset and F : M ^ M be a 
Lipschitz continuous map. For B >0 and a fixed j £ {1, . . . ,d} we write 
X ■.= M+[-B,Bf, Y:=7rj{X) and Z :=X xY, where ttj :M.'^ ^ R denotes 
the jth coordinate projection. For h G Lip(Z) and x G M, £o,ei G [—B,B]'^ 
we define the function h:Mx [—B,B]'^'^^R by 

(30) h{x,eo,ei) := h{x + eo,TTj{F{x) 
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Then for all x G M and eQ,ei € [—B,B]'^ we have 

\\h{^,-,-)\\up{[~B,B]^'i) < (1 + ll^llLip(A/))l|/l||Lip(Z), 

ll^(-,eo,ei)||Lip(A/) < ll^llLip(z)- 

Proof. For {eo,ei), {eQ,e[) G [-B,B]'^ x [-B,Bf we obviously have 
\h{x + eo,'Kj{F{x) + ei)) — h{x + e'o, vtj 

< ll^llLip(z)(lko - eX + \7r,{F{x) + e^) - nj{Fix) + e[)\Y' 
<\\h\\up{Z)\\{eo,ei)-{e'o,e[)\\2. 

Analogously, for x,x' £ M we have 

\h{x + eo,7rj{F{x) + ei)) - h{x' + eo,TTj{F{x') + ei))\ 

< \\h\\ur>iZ){\\x - x'Wl + |vr,(F(x) + e,) - n,{F{x') + e. f f' 

< ll^llLip(Z)(l + ll^llLip(A/))lk-a:'l|2• 
From these estimates we easily obtain the assertions. □ 

The next theorem bounds the correlation for functions defined by (30). 

Theorem 6.3. Let M C M'^ he compact and F:M^M be Lipschitz 
continuous such that the dynamical system X := {F^)i>o has an ergodic mea- 
sure II. Moreover, let 7 = (7j)i>o &e a strictly positive sequence converging 
to zero such that 

(31) cor;t(^,¥') G A(7), V', G Lip(M). 

Furthermore, let £ = (ej)j>o be a second-order stationary, [—B,BY -valued 
process on {Q,B,u) such that the [— B , B^'^ -valued process 3^ = (5^)i>o on 
{Q,B,v) that is defined by Yi{'d) = (ej(i9),ej+i (??)), i > 0, -d e Q, satisfies 

(32) coryii;, if) G A(7), V, ^ G Lip([-5, Bf''). 

For a fixed j e {!,..., d} we write X := M -\- [-B,Bf, Y := iTjiX), and 
Z := X X Y . Define the process Z = (Zj)j>o on (QxQjA'SiBjUfSiu) by 
Zj = (F*, Ej, ej_|_i), i > 0. Then for all ^ , if £ Lip{Z) we have 

cor^(V',(^) G A(7), 

where ij) and (p are defined by (30). 
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Proof. Let cx and cy be the constants we obtain by applying (31) and 
(32) to Corollary 5.3. Moreover, since £ is second-order stationary, we ob- 
serve that y is identically distributed. Applying Lemma 6.1 to the processes 
X and y then yields 

|cor2,i(V5,(^)| 

<|E,cor;f,*(^(-,>o),^(-,^*))| 

+ I E,^^E,,^^ coxy,i {^{F\x),-),^{F\x')r))\ 

< CA'E^||V'(-, eo, ei) llLip(Af) ll<^(-, ^i,£i+i)\\up{M) ■ It 

+ CyE^^^E^./^^||Vi(x, ■)\\up([-B,B]2d)\\tp{x', •)llLip([-B,B]2<*) " 7i 

<CA'||'0llLip(Z)ll¥'llLip(Z) -li 

+ Cyil + \\F\\up{M))\\i'\\up(Z)\Mup{Z) ■ li, 

where in the last step we used Lemma 6.2. □ 

Note that for using the estimate of Theorem 6.3 in Lemma 5.8 it is nec- 
essary that the process 3^ be second-order stationary. Obviously, the latter 
is satisfied if the process £ is stationary. 

Proof of Theorem 3.1. For a fixed j £ {I, . . . ,d} we write X := 
M + [-B,B]'^ and Y := TTj{X). Moreover, we define the X x y-valued pro- 
cess Z = {Xi,Yi)i>o on (M X [- B , B]'^^ , n v) by Xi := + ttq o and 
Yi := ■nj{F'^^^ + ttq o 5*"*"^), and in addition, we write P^^^ := {fx '^){Xo,Yo)- 
Let us further consider the M x [— i?, S]^*^- valued stationary process Z := 
{F\ttooS\ttooS'+^) which is defined on {M x [-B, B]'^^ , fi(S) u). For 
Lip(X X Y), Theorem 6.3 together with our decay of correlations assump- 
tions then shows | cor^ ^{ijj, ip)\ < n^^^ji for all i > 0, where k^^^ G [0, oo) is a 
constant independent of i. Moreover, our construction ensures coiz,i{'4', v) = 
cor J ^^tp, (p) for all i > and hence Theorem 2.4 yields 

/i u{{x, e) G M X [-B, S]'^^ : |7^^^p(,) (%)(,^,)_;,,,^^,J - ^LpU) I > e) - 

for n — > oo and all e > 0. Using Assumption LD and the definition (10) we 
then easily obtain the assertion. □ 



APPENDIX: PROOF OF THEOREM 5.1 

In the following, Be denotes the closed unit ball of a Banach space 
E. Recall that a linear operator S : E ^ F acting between two Banach 
spaces E and F is continuous if and only if it is hounded, that is, ||5|| := 



SVMS FOR FORECASTING DYNAMICAL SYSTEMS 



33 



sup^jgg^ ||5x|| < oo. Our first goal is to recall another equivalent condition 
which in practice is often easier to check. To this end, we need the following 
definition: 

Definition A.l. Let E and F be Banach spaces and S : E ^ F he a 
linear map. Then S is said to have a closed graph if for all x £ E, y £ F and 
all sequences (xn) C E satisfying Xn^ x and Sxn — > y we have Sx = y. 

Obviously, every continuous linear operator has a closed graph. The fol- 
lowing fundamental result from functional analysis known as the closed graph 
theorem shows the converse implication. 

Theorem A. 2. Let E and F be Banach spaces and S :E ^ F be a linear 
map that has a closed graph. Then S is continuous. 

Our next goal is to establish an analogous result for bilinear maps. To 
this end, we first recall the principle of uniform boundedness, which is also 
known as Banach-Steinhaus theorem. 

Theorem A. 3. Let E and F be Banach spaces, A be a nonempty set, 
and Sa : E ^ F , a £ A, be bounded linear operators such that 

sup IIS'q.xII < oo 

for all X G E. Then we even have sup^.^^^ sup^.^^^ 1 1 5*02; || < oo. 

Let us now recall that a map S :Ei x E2 ^ F between Banach spaces Ei , 
E2 and F is called bilinear if the maps ^(xi, •) : £"2 — > -F and S{-,X2) :Ei ^ F 
are linear for all xi £ Ei and X2 £ E2. In order to state a closed graph 
theorem for bilinear maps we also need a notion which describes a closed 
graph property for bilinear maps: 

Definition A. 4. Let Ei , E2 and F be Banach spaces and S:EixE2^ 
be a bilinear map. Then S is said to have a partially closed graph if the 

linear maps S{xi, ■):E2^F and S{-, X2) :Ei ^ F have closed graphs for all 

xi £ El and X2 £ E2. 

With these preparations we can now state and prove the announced closed 
graph theorem for bilinear maps: 

Theorem A. 5. Let Ei, E2 and F be Banach spaces and S:Ei x E2 ^ 
F be a bilinear map that has a partially closed graph. Then there exists a 
constant c £ [0, 00) such that 

\\S{xi,X2)\\f < c\\xi\\e^ ■ \\x2\\e2, Xi £ Ei,X2 £ E2. 
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Proof. By the closed graph theorem the maps S{xi, ■) : E2 — > F and 
S{-,X2) :Ei^ F are bounded hnear operators for all xi G Ei and X2 G E2. 
In particular, the boundedness of the operators S{-,X2) — > F yields 

sup ||S'(xi,X2)|| < 00, X2^E2. 

Applying the principle of uniform boundedness to the family of bounded 
operators {S{xi,-))xi^Be^ ihns shows 

c:= sup sup ||5(xi, a;2)|| < 00. 

Using the bilinearity of S we then obtain the assertion. □ 

With these preparations we can now present the proof of Theorem 5.1. 

Proof of Theorem 5.1. Obviously, coiz'-Ei x E2 ^ F is a well- 
defined bilinear operator. In view of Theorem A. 5 it suffices to show that this 
operator has a partially closed graph. We begin by showing that corzii^, ") : 
E2 ^ F has a closed graph for all ^ G i^i . To this end let us fix some ip (z Ei, 
if G E2, a sequence b := (6n)n>o G F and a sequence {ifi)i>i C E2 such that 
limi_>oo \\(pi - (p\\e2 = and 

(33) lim II cor2('(/',V'i) - ^Ib = 0- 

Obviously, corzlip, ■) ■.E2 ^ F has a closed graph if corz{ip, f) = b. To show 
this equality we first observe that for fixed n > and i ^ 00 we have 

< \\<^-'Pi\\Lj_(P) < ||id:£;2 ^ ^2(^')|| • \\^-^i\\E2 
and 

/ il^iZo) ■ (f{Zn)dfl- / ll){Zo) ■ ipi{Zn) dfi 

Jn Jn 

< M\L2iP) ■ \\'P-'Pi\\L2(P) 

< ||V||l2(P) • l|id:i?2 ^ ^2(^)11 • y - ^^\\E2■ 

From this we conclude limi^ooCoiz,ni'>P,^i) = corz,nii^,^) for the nth co- 
ordinate of sequences of correlations. Moreover, F is continuously included 
in ioo and hence (33) implies limi-^c>o cor z,ni'4^ , fi) = bn for all n > 0. Com- 
bining these considerations yields coiz,n{i^,f) = bn for all n > 0, that is, 
we have shown that corz{ip, ■):E2 ^ F has a closed graph. Since the fact 
that all corz{-,^) ■ El F have a closed graph can be shown completely 
analogously, the proof is complete. □ 
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